Analyzing Moving Reviews

Posted on Dim 23 septembre 2018 in Data Analysis

Movie Reviews & Descriptive Statistics

The dataset was put together to help detect bias in the movie review sites. Each of these sites has 2 types of score -- User scores, which aggregate user reviews, and Critic score, which aggregate professional critical reviews of the movie.

The dataset contains information on most movies from 2014 and 2015 and was used to help the team at FiveThirtyEight explore Fandango's suspiciously high ratings.

The goal is to figure out suspiciously high ratings with Descriptive Statistics

In [1]:
import pandas as pd
movies = pd.read_csv("fandango_score_comparison.csv")
In [2]:
movies.head()
Out[2]:
FILM RottenTomatoes RottenTomatoes_User Metacritic Metacritic_User IMDB Fandango_Stars Fandango_Ratingvalue RT_norm RT_user_norm ... IMDB_norm RT_norm_round RT_user_norm_round Metacritic_norm_round Metacritic_user_norm_round IMDB_norm_round Metacritic_user_vote_count IMDB_user_vote_count Fandango_votes Fandango_Difference
0 Avengers: Age of Ultron (2015) 74 86 66 7.1 7.8 5.0 4.5 3.70 4.3 ... 3.90 3.5 4.5 3.5 3.5 4.0 1330 271107 14846 0.5
1 Cinderella (2015) 85 80 67 7.5 7.1 5.0 4.5 4.25 4.0 ... 3.55 4.5 4.0 3.5 4.0 3.5 249 65709 12640 0.5
2 Ant-Man (2015) 80 90 64 8.1 7.8 5.0 4.5 4.00 4.5 ... 3.90 4.0 4.5 3.0 4.0 4.0 627 103660 12055 0.5
3 Do You Believe? (2015) 18 84 22 4.7 5.4 5.0 4.5 0.90 4.2 ... 2.70 1.0 4.0 1.0 2.5 2.5 31 3136 1793 0.5
4 Hot Tub Time Machine 2 (2015) 14 28 29 3.4 5.1 3.5 3.0 0.70 1.4 ... 2.55 0.5 1.5 1.5 1.5 2.5 88 19560 1021 0.5

5 rows × 22 columns

Histograms

In [39]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.hist(movies['Fandango_Stars'])
plt.show()
plt.hist(movies['Metacritic_norm_round'])
plt.show()

In 'Metacritic' column ratings are spread out in contrast to 'Fandango_Stars' where ratings are between 3.0 and 5.0

Mean, Median, And Standard Deviation

In [40]:
mean = movies[['Fandango_Stars','Metacritic_norm_round']].mean()
median = movies[['Fandango_Stars','Metacritic_norm_round']].median()
std = movies[['Fandango_Stars','Metacritic_norm_round']].std()
print(mean, median, std)
Fandango_Stars           4.089041
Metacritic_norm_round    2.972603
dtype: float64 Fandango_Stars           4.0
Metacritic_norm_round    3.0
dtype: float64 Fandango_Stars           0.540386
Metacritic_norm_round    0.990961
dtype: float64

Fandango vs Metacritic Methodology

Fandango appears to inflate ratings and isn't transparent about how it calculates and aggregates ratings. Metacritic publishes each individual critic rating, and is transparent about how they aggregate them to get a final rating.

Fandango vs Metacritic differences

The median metacritic score appears higher than the mean metacritic score because a few very low reviews "drag down" the mean. The median fandango score is lower than the mean fandango score because a few very high ratings "drag up" the mean.

Fandango ratings appear clustered between 3 and 5 so the Standard Deviation is smaller than Metacritic reviews, which go from 0 to 5, with a higher Standard Deviation.

Fandango ratings in general appear to be higher than metacritic ratings.

Fandango's main business is selling movie tickets, so they could bias their ratings to sell more tickets. And it explain why they calculates its ratings in a hidden way.

Scatter Plots : detect outliers ratings

In [41]:
plt.scatter(movies['Fandango_Stars'],movies['Metacritic_norm_round'])
plt.show()
In [42]:
import numpy as np

movies['fm_diff'] = movies['Fandango_Stars'] - movies['Metacritic_norm_round']
movies['fm_diff'] = np.absolute(movies['fm_diff'])
In [56]:
movies_sorted = movies.sort_values(by="fm_diff", ascending = False)
movies_sorted[['FILM','Fandango_Stars','Metacritic_norm_round']].head()
Out[56]:
FILM Fandango_Stars Metacritic_norm_round
3 Do You Believe? (2015) 5.0 1.0
85 Little Boy (2015) 4.5 1.5
47 Annie (2014) 4.5 1.5
19 Pixels (2015) 4.5 1.5
134 The Longest Ride (2015) 4.5 1.5

We computed ratings differences between Fandango and Metacritic, took the absolute values and sorted values in descending to select the largest outliers.

Correlations between Fandango & Metacritic Ratings

In [47]:
from scipy.stats.stats import pearsonr

r, p_value = pearsonr(movies['Fandango_Stars'], movies['Metacritic_norm_round'])
print(r)
0.178449190739

The low correlation between Fandango and Metacritic scores indicates that Fandango scores aren't just inflated, but are just different. They must inflate ratings depending on some special criterias.

Linear Regression based on Metacritic Score

In [48]:
from scipy.stats import linregress

slope, intercept, r_value, p_value, stderr_slope = linregress(movies['Metacritic_norm_round'], movies['Fandango_Stars'])

predicted_y_fandango = slope * 3 + intercept

print(predicted_y_fandango)
4.09170715282

A movie with a rate of 3 in Metacritic would be a rate of 4.1 for Fandango

Linear Regression and Scatter plot

Better visualize how the line relates to the existing datapoints.

In [54]:
predicted_y_1 = slope * 1.0 + intercept
predicted_y_5 = slope * 5.0 + intercept

plt.scatter(movies["Metacritic_norm_round"], movies["Fandango_Stars"])
plt.plot([1,5],[predicted_y_1,predicted_y_5])
plt.xlim(1,5)

plt.show()